GENOMIC SELECTION DAIRRy-BLUP: A High-Performance Computing Approach to Genomic Prediction
نویسندگان
چکیده
In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression–best linear unbiased prediction (RR-BLUP) is based on the assumptions that SNP marker effects are normally distributed, are uncorrelated, and have equal variances. We propose DAIRRy-BLUP, a parallel, Distributed-memory RR-BLUP implementation, based on single-trait observations (y), that uses the Average Information algorithm for restricted maximum-likelihood estimation of the variance components. The goal of DAIRRyBLUP is to enable the analysis of large-scale data sets to provide more accurate estimates of marker effects and breeding values. A distributed-memory framework is required since the dimensionality of the problem, determined by the number of SNP markers, can become too large to be analyzed by a single computing node. Initial results show that DAIRRy-BLUP enables the analysis of very large-scale data sets (up to 1,000,000 individuals and 360,000 SNPs) and indicate that increasing the number of phenotypic and genotypic records has a more significant effect on the prediction accuracy than increasing the density of SNP arrays. IN the field of genomic prediction, genotypes of animals or plants are used to predict either phenotypic properties of new crosses or estimated breeding values (EBVs) for detecting superior parents. Since quantitative traits of importance to breeders are mostly regulated by a large number of loci (QTL), high-density SNP markers are used to genotype individuals [e.g., .2000 QTL are already found to contribute to the composition and production of milk in cattle (http://www. animalgenome.org/cgi-bin/QTLdb/BT/index)]. The most frequently used SNP arrays for cattle consist of 50,000 SNP markers, but even genotypes with 700,000 SNPs are already available (Cole et al. 2012). Some widely used analysis methods rely on a linear mixedmodel backbone (Meuwissen et al. 2001) in which the SNP marker effects are modeled as random effects, drawn from a normal distribution. The predictions of the marker effects are known as best linear unbiased predictions (BLUPs), which are linear functions of the response variates. It has been shown that when no major genes contribute to the trait, Bayesian predictions and BLUP result in approximately the same prediction accuracy for the EBVs (Hayes et al. 2009; Legarra et al. 2011; Daetwyler et al. 2013). At Present the number of individuals included in the genomic prediction setting is still an order of magnitude smaller than the number of genetic markers on widely used SNP arrays, favoring algorithms that directly estimate EBVs, which is in this case computationally more efficient than first estimating the marker effects (VanRaden 2008; Misztal et al. 2009; Piepho 2009; Shen et al. 2013). Nonetheless, it has been shown theoretically (Hayes et al. 2009) that to increase the prediction accuracy of the EBVs for traits with a low heritability, the number of genotyped records should increase dramatically. Most widely used implementations like synbreed (Wimmer et al. 2012) and BLUPF90 (Misztal et al. 2002) are limited by the computational resources that can be provided by a single workstation. We present DAIRRyBLUP, a parallel framework that takes advantage of a distributed-memory compute cluster to enable the analysis of Copyright © 2014 by the Genetics Society of America doi: 10.1534/genetics.114.163683 Manuscript received March 3, 2014; accepted for publication April 10, 2014; published Early Online April 15, 2014. Supporting information is available online at http://www.genetics.org/lookup/suppl/ doi:10.1534/genetics.114.163683/-/DC1. Corresponding author: KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium. E-mail: [email protected] Genetics, Vol. 197, 813–822 July 2014 813 large-scale data sets. DAIRRy-BLUP is an acronym for a Distributed-memory implementation of the Average Information algorithm for restricted maximum-likelihood estimation of the variance components in a Ridge Regression-BLUP setting based on single-trait observations y. Additionally, results on simulated data illustrate that the use of such large-scale data sets is warranted as it significantly improves the prediction accuracy of EBVs and marker effects. DAIRRy-BLUP is optimized for processing large-scale dense data sets with SNP marker data for a large number of individuals (.100,000). Variance components are estimated based on the entire data set, using the average information restricted maximum-likelihood (AI-REML) algorithm. Therefore, it distinguishes itself from other implementations of the AI-REML algorithm that are optimized for sparse settings and are not able to make use of the aggregated compute power of a supercomputing cluster (Misztal et al. 2002; Masuda et al. 2013). All matrix equations in the algorithms used by DAIRRyBLUP are solved using a direct Cholesky decomposition. Although efficient iterative techniques exist that can reduce memory requirements and calculation time for solving a matrix equation (Legarra and Misztal 2008), the choice for a direct solver is motivated by the use of the AI-REML algorithm for variance component estimation on the entire data set. Materials and Methods
منابع مشابه
DAIRRy-BLUP: a high-performance computing approach to genomic prediction.
In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression-best linear unbiased prediction (RR-BLUP) is based on the assumptions that SNP marker effects are normally distributed, are uncorrelated, and have equal variances. We propose DAIRRy-BLUP, a parallel, Distributed-memory RR...
متن کاملComparing Different Marker Densities and Various Reference Populations Using Pedigree-Marker Best Linear Unbiased Prediction (BLUP) Model
In order to have successful application of genomic selection, reference population and marker density should be chosen properly. This study purpose was to investigate the accuracy of genomic estimated breeding values in terms of low (5K), intermediate (50K) and high (777K) densities in the simulated populations, when different scenarios were applied about the reference populations selecting. Af...
متن کاملAccuracy of Genomic Selection Methods in a Standard Data Set of Loblolly Pine (Pinus taeda L.)
Genomic selection can increase genetic gain per generation through early selection. Genomic selection is expected to be particularly valuable for traits that are costly to phenotype and expressed late in the life cycle of long-lived species. Alternative approaches to genomic selection prediction models may perform differently for traits with distinct genetic properties. Here the performance of ...
متن کاملPredictive Ability of Statistical Genomic Prediction Methods When Underlying Genetic Architecture of Trait Is Purely Additive
A simulation study was conducted to address the issue of how purely additive (simple) genetic architecture might impact on the efficacy of parametric and non-parametric genomic prediction methods. For this purpose, we simulated a trait with narrow sense heritability h2= 0.3, with only additive genetic effects for 300 loci in order to compare the predictive ability of 14 more practically used ge...
متن کاملTitle: Accuracy of Genomic Selection Methods in a Standard Dataset of Loblolly Pine
Genomic selection can increase genetic gain per generation through early selection. Genomic selection is expected to be particularly valuable for traits that are costly to phenotype, and expressed late in the life-cycle of long-lived species. Alternative approaches to genomic selection prediction models may perform differently for traits with distinct genetic properties. Here the performance of...
متن کامل